71 research outputs found

    Model-based clustering of categorical data based on the Hamming distance

    Full text link
    A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case, when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches

    Mixture modeling via vectors of normalized independent finite point processes

    Full text link
    Statistical modeling in presence of hierarchical data is a crucial task in Bayesian statistics. The Hierarchical Dirichlet Process (HDP) represents the utmost tool to handle data organized in groups through mixture modeling. Although the HDP is mathematically tractable, its computational cost is typically demanding, and its analytical complexity represents a barrier for practitioners. The present paper conceives a mixture model based on a novel family of Bayesian priors designed for multilevel data and obtained by normalizing a finite point process. A full distribution theory for this new family and the induced clustering is developed, including tractable expressions for marginal, posterior and predictive distributions. Efficient marginal and conditional Gibbs samplers are designed for providing posterior inference. The proposed mixture model overcomes the HDP in terms of analytical feasibility, clustering discovery, and computational time. The motivating application comes from the analysis of shot put data, which contains performance measurements of athletes across different seasons. In this setting, the proposed model is exploited to induce clustering of the observations across seasons and athletes. By linking clusters across seasons, similarities and differences in athlete's performances are identified

    Gaussian graphical modeling for spectrometric data analysis

    Get PDF
    Motivated by the analysis of spectrometric data, we introduce a Gaussian graphical model for learning the dependence structure among frequency bands of the infrared absorbance spectrum. The spectra are modeled as continuous functional data through a B-spline basis expansion and a Gaussian graphical model is assumed as a prior specification for the smoothing coefficients to induce sparsity in their precision matrix. Bayesian inference is carried out to simultaneously smooth the curves and to estimate the conditional independence structure between portions of the functional domain. The proposed model is applied to the analysis of infrared absorbance spectra of strawberry purees

    Dynamic model-based clustering for spatio-temporal data

    Get PDF
    In many research fields, scientific questions are investigated by analyzing data collected over space and time, usually at fixed spatial locations and time steps and resulting in geo-referenced time series. In this context, it is of interest to identify potential partitions of the space and study their evolution over time. A finite space-time mixture model is proposed to identify level-based clusters in spatio-temporal data and study their temporal evolution along the time frame. We anticipate space-time dependence by introducing spatio-temporally varying mixing weights to allocate observations at nearby locations and consecutive time points with similar cluster’s membership probabilities. As a result, a clustering varying over time and space is accomplished. Conditionally on the cluster’s membership, a state-space model is deployed to describe the temporal evolution of the sites belonging to each group. Fully posterior inference is provided under a Bayesian framework through Monte Carlo Markov chain algorithms. Also, a strategy to select the suitable number of clusters based upon the posterior temporal patterns of the clusters is offered. We evaluate our approach through simulation experiments, and we illustrate using air quality data collected across Europe from 2001 to 2012, showing the benefit of borrowing strength of information across space and time

    Estimate of overdiagnosis of breast cancer due to mammography after adjustment for lead time. A service screening study in Italy

    Get PDF
    INTRODUCTION: Excess of incidence rates is the expected consequence of service screening. The aim of this paper is to estimate the quota attributable to overdiagnosis in the breast cancer screening programmes in Northern and Central Italy. METHODS: All patients with breast cancer diagnosed between 50 and 74 years who were resident in screening areas in the six years before and five years after the start of the screening programme were included. We calculated a corrected-for-lead-time number of observed cases for each calendar year. The number of observed incident cases was reduced by the number of screen-detected cases in that year and incremented by the estimated number of screen-detected cases that would have arisen clinically in that year. RESULTS: In total we included 13,519 and 13,999 breast cancer cases diagnosed in the pre-screening and screening years, respectively. In total, the excess ratio of observed to predicted in situ and invasive cases was 36.2%. After correction for lead time the excess ratio was 4.6% (95% confidence interval 2 to 7%) and for invasive cases only it was 3.2% (95% confidence interval 1 to 6%). CONCLUSION: The remaining excess of cancers after individual correction for lead time was lower than 5%

    breast screening axillary lymph node status of interval cancers by interval year

    Get PDF
    Abstract The aim of this study was to determine whether the excess risk of axillary lymph node metastases (N+) differs between interval breast cancers arising shortly after a negative mammography and those presenting later. In a registry-based series of pT1a–pT3 breast carcinoma patients aged 50–74years from the Italian screening programmes, the odds ratio (OR) for interval cancers ( n =791) versus the screen-detected (SD) cancers ( n =1211) having N+ was modelled using forward stepwise logistic regression analysis. The interscreening interval was divided into 1–12, 13–18, and 19–24months. The prevalence of N+ was 28% among SD cancers. With a prevalence of 38%, 42%, and 44%, the adjusted (demographics and N staging technique) OR of N+ for cancers diagnosed between 1–12, 13–18, and 19–24months of interval was 1.41 (95% confidence interval 1.06–1.87), 1.74 (1.31–2.31), and 1.91 (1.43–2.54), respectively. Histologic type, tumour grade, and tumour size were entered in turn into the model. Histologic type had modest effects. With adjustment for tumour grade, the ORs decreased to 1.23 (0.92–1.65), 1.58 (1.18–2.12), and 1.73 (1.29–2.32). Adjusting for tumour size decreased the ORs to 0.95 (0.70–1.29), 1.34 (0.99–1.81), and 1.37 (1.01–1.85). The strength of confounding by tumour size suggested that the excess risk of N+ for first-year interval cancers reflected only their higher chronological age, whereas the increased aggressiveness of second-year interval cancers was partly accounted for by intrinsic biological attributes

    Clinical epigenetics settings for cancer and cardiovascular diseases: real-life applications of network medicine at the bedside

    Get PDF
    Despite impressive efforts invested in epigenetic research in the last 50 years, clinical applications are still lacking. Only a few university hospital centers currently use epigenetic biomarkers at the bedside. Moreover, the overall concept of precision medicine is not widely recognized in routine medical practice and the reductionist approach remains predominant in treating patients affected by major diseases such as cancer and cardiovascular diseases. By its' very nature, epigenetics is integrative of genetic networks. The study of epigenetic biomarkers has led to the identification of numerous drugs with an increasingly significant role in clinical therapy especially of cancer patients. Here, we provide an overview of clinical epigenetics within the context of network analysis. We illustrate achievements to date and discuss how we can move from traditional medicine into the era of network medicine (NM), where pathway-informed molecular diagnostics will allow treatment selection following the paradigm of precision medicine

    Bayesian space-time data fusion for real-time forecasting and map uncertainty

    Get PDF
    Environmental computer models are deterministic models devoted to predict several environmental phenomena such as air pollution or meteorological events. Numerical model output is given in terms of averages over grid cells, usually at high spatial and temporal resolution. However, these outputs are often biased with unknown calibration and not equipped with any information about the associated uncertainty. Conversely, data collected at monitoring stations is more accurate since they essentially provide the true levels. Due the leading role played by numerical models, it now important to compare model output with observations. Statistical methods developed to combine numerical model output and station data are usually referred to as data fusion. In this work, we first combine ozone monitoring data with ozone predictions from the Eta-CMAQ air quality model in order to forecast real-time current 8-hour average ozone level defined as the average of the previous four hours, current hour, and predictions for the next three hours. We propose a Bayesian downscaler model based on first differences with a flexible coefficient structure and an efficient computational strategy to fit model parameters. Model validation for the eastern United States shows consequential improvement of our fully inferential approach compared with the current real-time forecasting system. Furthermore, we consider the introduction of temperature data from a weather forecast model into the downscaler, showing improved real-time ozone predictions. Finally, we introduce a hierarchical model to obtain spatially varying uncertainty associated with numerical model output. We show how we can learn about such uncertainty through suitable stochastic data fusion modeling using some external validation data. We illustrate our Bayesian model by providing the uncertainty map associated with a temperature output over the northeastern United States

    Quantifying uncertainty associated with a numerical model output

    No full text
    Environmental numerical models are deterministic tools widely used to simulate and predict complex systems. However, they are unsatisfying since they do not provide information about the uncertainty associated with their predictions. Conversely, uncertainty assessment of model outputs can be useful to guide environmental agencies in improving computer models. We propose a Bayesian hierarchical model to obtain spatially varying uncertainty associated with a numerical model output. We show how we can learn about such uncertainty through suitable stochastic data fusion modeling using some external validation data. The model is illustrated by providing the uncertainty map associated with a temperature output over the northeastern United States
    • …
    corecore